Annotating URLs with Query Terms: What Factors Predict Reliable Annotations?

نویسندگان

  • Suzan Verberne
  • Max Hinne
  • Maarten van der Heijden
  • Eva D'hondt
  • Wessel Kraaij
  • Theo P. van der Weide
چکیده

A number of recent studies have investigated the relation between URLs and associated query terms from search engine log files. In [5], the query terms associated with the domain of a URL were used as features for a URL classification task. The idea is that query terms that lead to successful classification of a URL are reliable semantic descriptors of the URL content. We follow up on this work by investigating which properties of a URL and its associated query terms predict the classification success. We construct a number of URL and query properties as predictors and proceed to analyze these in-depth. We conclude that the classification success — and thus the reliability of the query terms as URL descriptors — cannot easily be predicted from properties of the URL and the queries.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

TreeGrafter: phylogenetic tree-based annotation of proteins with Gene Ontology terms and other annotations

Summary: TreeGrafter is a new software tool for annotating protein sequences using annotated phylogenetic trees. Currently, the tool provides annotations to Gene Ontology terms, and PANTHER protein class, family and subfamily. The approach is generalizable to any annotations that have been made to internal nodes of a reference phylogenetic tree. TreeGrafter takes each input query protein sequen...

متن کامل

Personalized Query Expansion in Social Search

Social tagging systems have gained increasing popularity as a method of annotating and categorizing a wide range of different web resources. Search in social tagging systems suffers from an extreme example of the vocabulary mismatch problem encountered in traditional Information Retrieval (IR). This is due to the personalized, unrestricted vocabulary that users choose to describe and tag each r...

متن کامل

Annotating Genes Using Textual Patterns

Annotating genes with Gene Ontology (GO) terms is crucial for biologists to characterize the traits of genes in a standardized way. However, manual curation of textual data, the most reliable form of gene annotation by GO terms, requires significant amounts of human effort, is very costly, and cannot catch up with the rate of increase in biomedical publications. In this paper, we present GEANN,...

متن کامل

Active Dual Supervision: Reducing the Cost of Annotating Examples and Features

When faced with the task of building machine learning or NLP models, it is often worthwhile to turn to active learning to obtain human annotations at minimal costs. Traditional active learning schemes query a human for labels of intelligently chosen examples. However, human effort can also be expended in collecting alternative forms of annotations. For example, one may attempt to learn a text c...

متن کامل

Beyond Click Graph: Topic Modeling for Search Engine Query Log Analysis

Search engine query log is a valuable information source to analyze the users’ interests and preferences. In existing work, click graph is intensively utilized to analyze the information in query log. However, click graph is usually plagued by low information coverage, failure of capturing the diverse types of co-occurrence and the incapability of discovering the latent semantics in data. In th...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009